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Abstract 



We study a gossip-based algorithm for searching data objects in a multipeer communication 
network. All of the nodes in the network are able to communicate with each other. There 
exists an initiator node that starts a round of searches by randomly querying one or more of 
ly-j \ its neighbours for a desired object. The queried nodes can also be activated and look for the 

£SJ ■ object. We examine several behavioural patterns of nodes with respect to their willingness to 

\ cooperate in the search. We derive mathematical models for the search process based on the 

balls and bins model, as well as known approximations for the rumour-spreading problem. All 
Q\ \ models are validated with simulations. We also evaluate the performance of the algorithm and 

■ examine the impact of search parameters. 

> : 

X ' 1 Introduction 



The term 'gossiping algorithm' encompasses any communication algorithm where messages be- 
tween two nodes are exchanged opportunistically, with the intervention of other nodes that act as 
betweeners or forwarders of the message. It is inspired from the social sciences, in the same way as 
epidemic protocols where inspired from the spreading of infectuous diseases [1] . These two commu- 
nication paradigms are very similar, with differences focusing on the different ways that gossiping 
nodes and infected nodes could behave: gossiping nodes adopt human-like characteristics, while 
the behaviour of infected nodes is governed by the dynamics of the virus or disease. 

Gossiping algorithms are suitable for communication in distributed systems, such as ad-hoc 
networks and generally systems with peer-to-peer or peer-to-multipeer communication. The latter 
communication paradigm is followed here, where a node can communicate with multiple peers, 
usually maintaining a short-time connection with each one. Attractive characteristics of gossiping 
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algorithms include simplicity, scalability and robustness to failures, as well as a speed of dissemi- 
nation that is easily configurable. Gossiping can be identified with the spreading of rumours in a 
network, the dymanics of which are investigated in [6j[5]. Additionally, gossiping protocols have 
been used for the computation of aggregate network quantities, such as sums, averages, or quantiles 
of certain node values [3]. 

In all of the above referenced works, gossiping is typically used for the dissemination of in- 
formation, and performance metrics are oriented towards measuring the efficiency of information 
dissemination. In this paper, we model a specific gossip-based algorithm that aims at finding files 
(or data objects, in general) in nodes in a distributed network. The algorithm employs sequentially- 
generated parallel search procedures, in the following manner: We assume there exists a file in the 
network that may be located in different nodes. An initiator node is interested in this file and 
starts a round of searches to find it, by randomly querying one or more of its peers (neighbours). 
The queried nodes can also be activated and look for the object. The search is considered successful 
when at least one copy of the file is found. 

Apart from the initiator, the other nodes that assist the search can have different behavioural 
patterns. We distinguish between cooperative and non-cooperative nodes. Nodes in the first cate- 
gory always become active when queried, and generate themselves query messages in subsequent 
rounds. Non-cooperative nodes on the other hand are unwilling to participate in the search pro- 
cess themselves. We also consider stifler nodes; the term is borrowed from [5] and signifies nodes 
that were previously active, but from a certain point on lose interest in the dissemination of the 
query, and thus cease to participate in the search. Hence, it is a special case of cooperation. To 
avoid confusion in the paper, non-stifler nodes that are non-cooperative are also referred to as 
plain non-cooperative nodes. In all the above cases cooperation is considered only with respect to 
participating in the search; if a node has the file it always returns it. 

We also derive different versions of the algorithm based on the level of knowledge that each 
node has about the progress of search. We consider two extremes: at the one, each node has no 
knowledge whatsoever about the number or identities of nodes that have been previously queried 
in the network. At the other extreme, each node has complete knowledge about these facts and 
avoids sending messages to previously queried nodes at subsequent rounds. We call these cases 
blind search and smart search, respectively. 

We mathematically model the blind search process based on a known approximation for the 
rumour spreading problem [6], which we extend here. The smart search process is modeled using 
a combinatorial approach, based on a generalization of the balls and bins model [2]. Based on 
these models, we are able to evaluate the performance of the search, as well as the impact of search 
parameters. The latter are the number of queried neighbours by a node and the number of copies 
of the file in the network. By changing these parameters, one can easily configure the speed and 
efficiency of the search, as will be shown later. 

Both the blind and smart versions of the gossiping algorithm have been modeled exactly in 
[8]. Both information dissemination and search are investigated in that paper, while here we are 
focusing on the search process. The main algorithmic difference in [8] is that even the nodes that 
have the file can be non-cooperative. In this paper, we want to focus only on the effect that 
cooperation has in the forwarding of the message: only the intermediate nodes forwarding a query 
can be non-cooperative. In addition, the approximative model presented here for the blind search 
process is shown to be computationally simpler, while maintaining good accuracy. Finally, in this 
paper we present the stifling behavioral pattern, which is not included in [8]. 
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The paper is structured as follows. In Section [21 the gossip-based search algorithm is described 
in more detail, and application scenarios that justify the study of the blind and smart search 
algorithms are discussed. In Sections [3] and H] we present the mathematical modeling of the blind 
and smart search algorithms, respectively. The modeling in these sections covers cooperative and 
plain non-cooperative nodes. The stifling behavioural pattern is analysed in Section[5j In Section[6l 
we present results for the performance of the blind search algorithm for very large numbers of nodes, 
and derive useful scaling laws. The major conclusions from this work and issues for future research 
are presented in Section [3 

2 Gossip-based search algorithm 

A model of the network in the form of a complete graph is considered. There is an initiator node Z, 
and N — 1 other nodes in the graph. There is a file / located in m of the other nodes of the graph 
(m < N — 1) that the initiator wants to find. The initiator starts a search by randomly querying 
a subset of its neighbors of size k (k < N — 1), with equal probabilities. 

If a queried node has the object then it returns it, and the query is successful. Otherwise 
the queried nodes - depending on being cooperative or not - may begin to search themselves by 
forwarding the query to their neighbours. Cooperative nodes which are queried become "active" 
and participate in the search. The process of the search is modeled in steps or rounds, where at 
each round all active nodes simultaneously query their neighbors, hence activating new nodes. The 
algorithm continues for several rounds where at each step, active nodes randomly query some of 
their neighboring nodes, until the file / is found. 

We consider two search scenarios: 

- Blind search: An active node searches "blindly" at each round, possibly querying nodes 
that have been queried before. This approach can model devices with small computational 
capabilities, that cannot keep a log of queried nodes, or cases where the identities of the 
devices are not known. It is equally appropriate to model situations with random encounters 
between nodes. For instance, a number of mobility models have exponential meeting times 
between mobile nodes (such as the Random Walk, Random Waypoint and Random Direction 
models, as well as more realistic, synthetic models based on these [7]). In our model, the time 
until a node is queried approaches a geometric distribution, which is the discrete time analog 
to an exponential distribution. 

- Smart search: An active node searches "smartly" at each round, by avoiding nodes that have 
been queried before either by itself or by other nodes. This demands the knowledge of the 
identities of all queried nodes, and has a larger overhead compared to the blind search case. 
We do not define the exact algorithm by which the identities of all queried nodes are made 
known to an active node. We only assume that this knowledge can be obtained at a cost 
that is small compared to the cost of searching, and use this case mainly as a reference for 
the efficiency of the blind search algorithm. It is evident that smart search corresponds to 
the fastest version of the algorithm. Although it can be hard and costly to implement, there 
can exist schemes that can approximate its performance. For example, a low-cost algorithm 
that could approximate smart search can be based on the routine that, at each peer-to-peer 
communication, nodes exchange the lists of peers they have queried. 
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3 Approximate blind search model with cooperative or non- 
cooperative nodes 



Each node that receives a search query will cooperate to forward the query with probability c 
(0 < c < 1). If several active nodes query the same node, the latter node decides whether to be 
cooperative or not by a single Bernoulli trial. We do not consider that the node performs several 
independent trials, one for each query. 

We consider a sequence of steps (or rounds) r = 1, 2, . . . until the file is found. If at step r there 
are A{r) active nodes then, provided the file is not yet found, the probability of finding it at the 
rth step, S(r), is: 

S(r) = l-(l- Ps ) A ^ , (1) 

where p s is the probability that a single search (consisting of k different random queries) succeeds. 

To find p s , notice that the problem is equivalent to the one where, in a set of N — 1 nodes, there 
are m marked nodes and we randomly select a group of k nodes. We want to find the probability 
that at least one marked node is selected. The probability that our selection returns exactly u 
marked nodes (u < min(m, k)) is 

tm\ (N—l—m\ 

\uJ\ k—u ) 

Pu- ^ • 

Indeed, the marked nodes can be chosen in ( m ) different ways, the unmarked ones in { N ~^L u m ) 
ways, and the total number of ways to select k nodes is { N ^ 1 )- Further, p s = 1 — po, therefore 

/N-l-m\ 

The probability of finding the file / at the rth step is, 

r-l 

p(r) = S(r)H(l-S(i)). (3) 

i=l 

This formula is an approximation because it implicitly assumes that each round is independent of 
the other. 

A deterministic approximation A{r) for the number of active nodes in each round can be found 
using the method presented in [6], which is extended to k neighbours that are cooperative with 
probability c. Consider the process {I(r), r > 1} of the number of inactive nodes in each round. 
Given that at round r there are i{r) inactive nodes and A{r) active ones, the mean number of 
inactive nodes at round r + 1 will be 



For fixed k and large N, we use a second order expansion of (1 — j^i) A ^ r \ so that (1 — jfb[) A ^ 



r) 



) , and 



E[I(r+l)] = i(r) 



(1 - c) + ce 



From this, by assuming J(r) = i(r) Vr, we derive the deterministic approximation 



J(r + 1) = J(r) 



(1 - c) + ce" 



Using that A(r) = N — I(r), we finally obtain the recursion: 

A(r + 1) = Nc + A(r)(l - c) - (JV - A(r))ce~ A{r)ij ^ T+ ^ z ^ ) , 



(4) 



(5) 



with A(l) = 1. Since ^4(r) is not an integer in general, we round it to the nearest integer, which 
we denote by A(r) = [A(r)]. 

Following a similar approach as in [6], it can be shown that the distribution of I(r + 1), given 
i(r) is indeed concentrated sharply around i(r)[(l — c) + cexp(— A(r)(jJ^ ^ WW-Fjz ))]> anc ^ ^ na * 
the approximation becomes more accurate as k/N — > 0. 

From ([3]), we have constructed an approximate distribution for the number of steps until the 
file is found. We can then derive the mean number of steps until the file is found and the mean 
number of nodes A involved in the search (activated nodes): 



E[r] =^rp(r), 

r=l 
oo 

E[A)=J2Mr + l)p(r). 



(6) 
(7) 



r=l 



(Notice that A(r + 1) nodes will be activated approximately at round r.) 

For numerical calculations, as an upper bound on the support of r we take 



r-l 



min{r : JJ(1 - S(i)) < e} , 



(8) 



where e is a number close to zero. 

Remark 1 . In [8] , we have derived an exact model for a slightly different version of the blind search 
algorithm. Generally, the exact approach for modeling the search algorithm requires the calculation 
of the N x N transition matrix Quj], where the (i, j)-th value is the probability of going from i to 
j active nodes in one round. Then Q r (l, i) denotes the probability that there are i active nodes in 
r rounds. 

The probability of finding the file in r rounds, denoted here by B(r), can be calculated as 



N 



B{r) = 



8=1 



\ m J 
V m I . 



Q r (l,i) 



(9) 
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This is a probability distribution, therefore the probability of finding the file exactly at round r is 
given by: 

B(r) - B(r - 1) . (10) 

In the Appendix, we compare the complexity of the two models and examine the accuracy of the 
approximate one. It is shown that the reduction in computational cost is of the order of 0(iV 2 ), 
while the relative accuracy of the approximation is higher than 95% in the majority of cases (the 
comparison holds when c = 1). 

We validate our approximation by means of simulation. The simulations were performed with 
100 instances of random file positions, and 100 random executions of the search in each instance, 
leading to a total of 10 4 repetitions in each experiment. We evaluate the mean number of rounds 
and the mean number of activated (infected) nodes until at least one copy of the file / is found, 
varying the number of nodes N in the graph, the cooperation probability c and parameters k and 
m. The value of e in ([8]) was set to 10~ 6 . Results for different cases are shown in Fig. [U 

These figures illustrate that the simulation results match those from the theoretical analysis very 
well, except for large values of /c0 As the size of the network increases, the number of active nodes 
increases linearly, while the increase in the mean number of rounds is super linear, at a decreasing 
rate, as shown in Fig. 1(a) and 1(b) This implies that the number of rounds can be well- fitted 



using a logarithmic function, as will be shown more clearly in Section [6l 



We examine the impact of search parameters k, m, in Fig. 1(c) 1(f) Note that the search can 
become faster by increasing either m or k. By comparing Fig 1(c) with 1(e), we note that the 



increase in speed is higher for large values of k. However, Fig 1(d) and 1(f) show that increasing k 
has the disadvantage of increasing the number of active nodes, and thus produces higher commu- 
nication overhead. For example, calculations based on the simulation results show that for c = 1 
and N = 50 nodes, increasing m to 3 yields a relative decrease in the mean number of rounds 
by 31%, and in the mean number of active nodes by 45%. On the other hand, increasing k to 
3 yields a higher relative decrease in the mean number of rounds by 48%, but an increase in the 
mean number of active nodes by 14%. Overall, we remark that increasing the number of queried 
neighbours results in a great redundancy in the number of nodes that participate in the search with 
only small gains in speed. 



4 Analysis for the smart search case with cooperative or non- 
cooperative nodes 

An analysis for the smart search process is presented below. An approximate analysis similar to 
the one for the blind search model fails here, due to the varying probabilities of successful query. 
Instead we adopt a direct combinatorial approach, by considering a generalization of the occupancy 
(balls and bins) problem [2]. 

The generalized balls and bins problem is defined as follows: In a population of n bins, suppose 
we randomly distribute r groups of k balls, such that in each group, no two balls go in the same 
bin and successive distributions of groups of balls are independent. We want to find the probability 
that exactly v bins remain empty, where v = 0, 1, . . . , n — k (it is assumed that n > k). 

1 It is noted that the improvement in accuracy by using the exact expression (1 — ^) , or adding more terms 
in its expansion, is negligible. 
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Figure 1: Analytical and simulation results for the mean number of rounds and mean number of 
active nodes with varying network parameters, for the blind search algorithm 



We follow the approach in [2j Section IV.2] for the classical occupancy problem (where k = 1). 
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The total number of ways to distribute r groups of balls in the way described above is (^) r . 
Similarly, the total number of ways to assign them to n — 1 bins is ("r 1 )^, so that the probability 
that one given bin is empty, is ("'Z 1 ) / {^Y ■ Generally, the probability that v given bins are empty 

■ (n-v\ r I (n\ r 

Therefore, the probability that at least one bin is empty is, by the inclusion-exclusion method, 



Ef- 1 ) (")■ 

i=l v 7 



(t) 



The probability that all bins are occupied, denoted by p (r,k,n), is 



n-k , v /n— i\ 



TA-m (id 



Consider now the case where exactly v non-given bins are empty. These v bins can be chosen in 
(") different ways. The k balls of each of the r groups are distributed among the remaining n — v 
bins such that exactly n — v are occupied. The mean number of such distributions is 

n — v\ r 

J p (r,k,n-v) . 

Dividing by the total number of possible configurations we obtain the probability p v (r,k,n) 
that exactly v bins are empty: 

^-OW-S^K • )w- (12> 

Based on (|12p . we find transition probabilities of the form p(xi,Xj), which denotes the proba- 
bility that if at a certain round of the algorithm there are Xi active nodes, then at the next round 
there will be Xj active ones (xj > Xj). It is emphasized here that each round corresponds to one 
transition. In our terminology, "at (or in) a certain round" will have the meaning "after the transi- 
tion that occured in this round and before the next transition". The first round marks the transition 
from 1 active node (the initiator) to a maximum number of k + 1 active nodes. 

Since there are no repetitions for the smart search, the transition probabilities can be found by 
directly applying ([12]) . substituting n = N — x%, v = N — Xj, and r = xf. 

p(xi,Xj) = p N - Xj (xi,k,N - Xi) . (13) 

From (fP2 1) ,(|T3" ]) we have 



m = {nZJj^ g • <14) 
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Each of the Xj — X{ queried nodes will decide whether or not to be cooperative independently 
with probability c. The probability that a out of Xj — Xi nodes will actually be activated is 



B( Xj - Xi ,a, c) = f ^ a Xi \ c a (l - of 

Therefore, the probability that there will be Xi + a active nodes in the next round is 

p(xi,Xi + a)= p(xi,Xj)B(xj - Xi,a,c) . (15) 

Based on the transition probabilities, we construct the N x N transition matrix Q with entries 
p(xi, Xj) for i,j = 1,...,N. The value of the ith element of the first row of the matrix Q r is the 
probability that there are i active nodes at round r. 

Let us denote by p s (v) the probability that at least one of the v active nodes finds a copy of 
the file. The probability S(r) of finding the file by (and including) round r is 

S(r) = ^2Q {r - 1) (hv)(l-(l-Ps(v)y) , (16) 

V 

where p s {v) is the probability that a search by a single node finds a copy of the file, given that 
there are already v active nodes. 

To find p s (v), we take a similar approach as in Section [3l It is 

/ N—v—m\ 

Ps(v) = 1 - (17) 



Finally, the probability of finding the file at the r-th round is given by ([3]) □ We emphasize that 
this formula is again an approximation, since it is assumed that each round is independent of the 
other. 

Based on the above distribution, we easily derive the expected number of rounds until a file is 
found, as in ([6]). The mean number of nodes activated during the search process is 

oo 

E[A]=Y J E[a{r)]p{r), (18) 

r=l 

where -E[a(r)] is the mean number of active nodes in round r, derived from the distribution Q r (l, •). 
(For the smart search, the summation index in (|B]). (|18|) is upper bounded by \N — 1/fe].) 

We take both analytical and simulation results for the same values of the parameters N, c, k 
and m as for the blind search algorithm. The simulations are conducted for the same number of 
repetitions as for blind search. Results are shown in Fig. [2j Generally, the model is extremely 
accurate for c = 1, but as the cooperation probability decreases it starts to deviate from the 
simulated behavior. For small values of c, we remark that the model is less accurate than our 
model for blind search, even though it follows a combinatorial approach that is exact up to (I16p . 
We attribute this to the fact that the intermediate search steps are more correlated for the smart 
search algorithm. Thus, as it is intuitively reasonable, the assumption of independence over rounds 
leads to worse results for the smart search than for the blind search case. 



Notice that (|16[) is not a cdf, so we can't use S(r) — S(r — 1) to find the probability of successful search at round 
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Smart search (k=1 ,m=1 ) 
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Figure 2: Analytical and simulation results for the mean number of rounds and mean number of 
active nodes with varying network parameters, for the smart search algorithm 



The same observations hold regarding the effect of the cooperation probability, the number of 
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queried neighbours and the number of copies of the file, as in the blind search case. It is again 
emphasized that the behavior with respect to increasing k is an outcome of the tradeoff between 
the increased speed of discovery and the redundancy in the total number of messages sent. 

We notice that there is only a small performance improvement of the smart over the blind search 
algorithm, expressed through the decrease in the mean number of rounds. This improvement 
becomes less pronounced as k or m increase. For example, based on the simulation results for 
N = 50 and c = 1, the relative reduction of smart search in the mean number of rounds is 13% 
when k = 1, m = 1, 3% when k = 1, m = 3, and 5% when k = 3, m = 1. This impovement 
is relatively larger when the cooperation probability decreases: for N = 50 and c = 0.5, the 
corresponding reductions were 27%, 16% and 6%. 

However, the mean number of active nodes may be greater for the smart search, due to the fact 
that we always query only inactive nodes. This was true in most of the derived results. For N = 50 
nodes, the relative increased reached up to 10% for c = 1 and k = 1, m = 3, while for c = 0.5 it 
increased up to 31% for k = 1, m = 3. For the values of N = 50 and c = 0.5, only a slight decrease 
of 1% was observed when k = 3, m = 1. 

The overall results illustrate that when comparing the two cases, the smart search does not 
offer a significant improvement. This leads us to the conclusion that if the overhead incurred in 
the smart search algorithm for informing all active nodes of the identities of queried nodes is not 
negligible compared to that of the search procedure, it is highly likely that there is not much to 
gain by such a scheme. 

5 Analysis with stiflers 

Another behavioral pattern that we consider is stifling. In this pattern, each of the nodes that 
are (or become) active at a certain round may cease to be active and not participate in the search 
process any more. This could express a node's loss of interest in spreading the query message 
further in the network. 

We analyse this stifling behaviour based on the assumption that at each round of the search, 
each active node may become a stifler independently with probability s. A node that becomes a 
stifler is considered as inactive, and in a blind search it may become active again with probability 
1 — s, if queried. We consider that the initiator does not become a stifler, so the number of active 
nodes will always be greater than zero. 

This stifling behaviour will be modeled only for the blind search case, based on our approxima- 
tive method. In order to model the smart search case, one has to discriminate between active and 
queried nodes, which leads to a multi-dimensional Markov chain which is not easily amenable to 
analysis. 

Given that there are i(r) inactive nodes at round r, we are interested in finding the mean 
number of inactive nodes at round r + 1. This consists of the mean number of active nodes at 
round r that became inactive (excluding the initiator) and the mean number of inactive nodes at 
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Blind search with stifling nodes (k=1 , m=1 ) 
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Figure 3: Mean number of rounds and mean number of active nodes upon discovery with k 
m = 1 for the blind search algorithm with stiflers 
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round r that remained inactive. Hence, 
E[I(r + 1)] ={A(r) - l)s + I(r) 

= {A(r) - l)s + i(r) 

Assuming J(r) = i(r) Vr, and using again that (1 
N — I(r), we finally obtain the deterministic approximation 

A(r + 1) = 1 + (N - 1)(1 — s) — (N - A(r))(l - s )e~ A{r)iT ^ I+ ^^ ) , 



AT- 



-A(r)( 



k fc N 

JV-l" l " 2 (iV-l) 2 ; 



A{r) = 
(19) 



with A(l) = 1. 

From there we can follow similar steps as in Section [3] to find performance measures of interest. 

Results based on simulation and the analytical approximation are shown in Fig. [3l for different 
values of parameters m, k, and the stifling probability s. As s increases, the performance of the 
search algorithm deteriorates. 

We observe that the approximate model follows very well the simulated behaviour, except for 
larger deviations in the mean number of rounds when the stifling probability is high. This is 
mainly due to the higher relative error that results from the rounding operation (see the analysis 
of Section [3]). As the stifling probability gets higher, the number of active nodes in the network 
is rounded to one in the model, and therefore the mean number of rounds approaches the inverse 
of the probability of a successful query of this node (geometric distribution). For example, when 
k = 1, the mean number of rounds approaches the value of (N — l)/m. 

A comparison of the speed of blind search between the stifling and plain non-cooperative case 
shows that the search algorithm performs worse in the presence of stifler nodes. We may remark 
from the results that the relative increase in the number of rounds when nodes behave as stiflers 
- rather than as plain non-cooperative nodes - becomes greater when the number of nodes in the 
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Figure 4: Comparison of smart search against blind search with stiflers for k 
different values of the stifling probability and increasing number of nodes. 



1 , m = 1, with 



network increase, or when the stifling probability increases. Since stifling is opposite to cooperation, 
it makes sense to examine dual values of s, c, i.e. such that s = 1 — c holds. For k = 1 ,m = 1, 
and N = 50, the mean number of rounds is increased by 9% for nodes that behave as stiflers with 
probability s = 0.2, compared to the case of plain non-cooperative nodes with c = 0.8. The relative 
increase is 5% when N = 10. When s = c = 0.5, the corresponding relative increase is much 
greater, and amounts to 78%. 

This difference becomes smaller as the search becomes faster, i.e. when increasing either the k 
or m parameters. Regarding the relative influence of the parameters k, m to the efficiency of the 
search the same observations hold, as in all previous cases. 

The mean number of active nodes calculated here is the mean number of nodes that are active 
upon discovery of the file. We therefore do not count nodes which were previously active in the 
search. Hence, it should be noted that the mean number of active nodes displayed here is only 
indicative of the communication overhead, as it does not count the nodes that were active in 
intermediate rounds of the algorithm, and hence the corresponding communication costs. Generally, 
our findings show that this number is much smaller when compared to the plain non-cooperative 
case, where active nodes remain in that state until the end. For k = l,m = 1, and N = 50, 
the number of active nodes upon discovery is decreased by 39% in the stifling case with s = 0.2, 
compared to the plain non-cooperative case with c = 0.8. 

We have also conducted a series of simulations to see the performance of smart search in the 
presence of stifling nodes. The smart search algorithm only queries nodes that have not been 
queried in previous rounds, although all active nodes may become stiflers at any round. In Fig. [H 
we show a comparison of smart search against blind search for k = 1 ,m = 1, with different values 
of the stifling probability and increasing number of nodes. 

We observe that smart search can yield a reduction in the number of rounds to discover the 
file, which becomes significant for high values of the stifling probability. For s = 0.8 and ./V = 50, 
the relative reduction is 37%. However, the interesting thing is that for smaller values it may also 
yield a slight increase (see the curves in Fig. 4(a) for s = 0.4). This seemingly unorthodox result is 
explained by the fact that since previously queried nodes are not queried again, once they become 
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stiflers, they permanently remain in that state and do not participate again in the search. Thus 
trying to implement a "smarter" algorithm may also result in reducing the effective number of 
searchers, thereby slowing down the search. 

Fig. |4(b)| shows that the mean number of active nodes upon discovery is smaller for smart search 
than for blind search, when s > (as opposed to the plain non-cooperative case, cf. Fig. I1I2[) . The 
relative decrease becomes greater for medium values of s (s = 0.4 in Fig. |4(b)[ ). For high values of 
this probability, the mean number of active nodes approaches one and differences become negligible. 
On the other hand, from Fig. 4(a) the highest gains in speed occur for s = 0.8. Therefore the highest 
gain in speed does not imply the highest reduction in redundant active nodes, and vice-versa. 



6 Scaling performance of blind search 

The low- complexity approximate model we have developed for the blind search algorithm enables 
us to study its performance for networks with very large numbers of nodes. We have taken results 
for networks with up to 10 5 nodes, for both behavioural profiles: plain non-cooperative and stifling. 
In Fig. [5l we plot the mean number of rounds and the mean number of active nodes as a function 
of N for the case of plain non-cooperative nodes, while in Fig. [6j similar results are taken for the 
case of stifling nodes. 
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Figure 5: Scaling performance of blind search in the plain non-cooperative case, for k 
(a) mean number of rounds, (b) mean number of active nodes 



10 3 



1, m = 1: 



The x-axis in all plots is in log scale. Fig. 5(b) and |6(b)"] are in log-log scale. In Fig. 6(a), the 
curve for s = 0.6 is scaled with respect to the right y-axis. 

From these results, we observe that the scaling performance of blind search is remarkably simple. 
In the plain-non cooperative case, the mean number of rounds increases linearly with log N, while 
the mean number of active nodes increases linearly with N. This is true for almost the whole 
range of values of c. (A more accurate estimate would be to consider a piece-wise linear function, 
with a slightly smaller slope for iV < 100.) In the case of stifling nodes, the behaviour is similar 
for values of s < 0.5 approximately. For large values of s, the mean number of rounds increases 
proportionately to the increase in the number of nodes, while the mean number of active nodes 
upon discovery approaches 1. 
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Figure 6: Scaling performance of blind search in the stifling case, for fe = 1, m = 1: (a) mean 
number of rounds, (b) mean number of active nodes upon discovery 



Based on the observed behaviour, we can easily derive fitted functions, based on the least 
squares method. For example, for k = 1 ,m = 1 and c = 1, E[r] = 0.629 log N + 0.057, and 
E[A] = 0.567iV + 0.584. 

7 Conclusions 

This paper has focused on the mathematical modeling of the gossip-based search algorithm in a 
complete graph. We focused mainly on the blind-search algorithm, which is totally ignorant of 
previous queries and can be implemented very easily, and compared its performance with respect 
to a smart-search algorithm, where previously queried nodes are avoided. 

Several conclusions can be extracted concerning trade-offs between speed, cost, and redundancy 
between "blind" and "smart" gossip-based search algorithms. We have investigated two extreme 
cases; many intermediate algorithms can be studied with different levels of knowledge of previous 
queries, trading speed of discovery with additional communication and processing overhead. An 
important observation is that speed also trades-off with the redundancy in the number of query 
messages needed to locate a file. The results we obtained provided serious indication that, when 
nodes have a plain non-cooperative profile, the additional overhead of designing a "smarter" algo- 
rithm is not worth it, since apart from the additional communication and processing cost, it induces 
high redundancy in the number of messages in the network. 

For both the blind and smart search cases, we showed that the mean number of active nodes and 
the mean number of rounds roughly increase linearly with the number of nodes in the network and 
its logarithm, respectively. For the blind search algorithm, we were able to confirm this behaviour 
for very large numbers of nodes, using the approximate model which has very low complexity. 

Another important observation concerns the relative impact on the search of the number of 
queried peers by each node, and of the number of copies of the data object in the network. The 
increase of both these parameters increases the speed of discovery. The relative increase is greater 
when the number of queried peers increases, in both the blind and smart search algorithms. How- 
ever, the corresponding increase in the number of active nodes that (in most cases) ensues is 
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inappropriate, and hence it is preferable to keep the value of this parameter very small. The 
gossip-based search algorithm performs better when the requested data object is spread to many 
nodes in the network. 

Other useful remarks from this research concern the effects of different behavioural profiles: 
cooperative, plain-non cooperative and stifling, with different degrees of cooperation. Stifling has 
a greater negative impact on the search performance, which becomes worse in large networks. 

Future research issues we envisage are mostly related to the performance evaluation of the gossip- 
based search algorithm. The most significant direction of research is to examine the efficiency of 
the search in different types of networks. From the network in the form of a complete graph that we 
studied here, we can pass to more general graphs that are often met, such as Erdos-Renyi graphs, 
or graphs with power-law degree distribution (scale- free networks). Performance results in such 
networks where each node has connections with different peers will give more realistic evidence of 
its efficiency and applicability. Finally, it is interesting to compare the performance of the algorithm 
with different distributed search schemes, in terms of speed and implementation cost. 
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A Comparison between the approximate and the exact model for 
the blind search algorithm 

We compare the approximate model for the blind search algorithm with the exact model developed 
in [8], when the cooperation probability c = 1. The two models are compared from the points of 
view of complexity and accuracy. 
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A.l Comparison of complexity 

We will compare the complexity of the two models based on the computational cost for deriving 
the probability of locating the file at a certain round r, given by ([3]) in the approximate and by 
([9]) in the exact model. The computational cost is measured based on the number of elementary 
steps needed to derive the location probability, where each step consists of a small number of 
elementary operations (addition, subtraction, multiplication or division). We use the O-notation 
as the asymptotic upper bound of the complexity. 

To compute ([3]), first computations of (|2j) and (pQ) need to be done. Equation ([2]) involves two 
binomial coefficients, which can be computed in 0(kN) time using the well-known linear recursion 
formulaH The recursive formula © involves a small number of multiplications and additions, and 
an exponential function. The exponential can be calculated easily by splitting the exponent into 
integer and fractional parts (the latter can be computed within high accuracy with a few terms 
only in a Taylor expansion, see e.g. [1]). Hence each execution of the recursion has order one, and 
computing A(r) or A(r) takes time 0(r). It holds that A(r) < N. Therefore, the computation of 
(PQ) and (J3l), given their input parameters, takes time O(N) and 0(r) respectively, since in the first 
case it involves the computation of A(r) polynomials, and in the second of r polynomials, both of 
degree one. Therefore, the total complexity of deriving the probability of locating the file at round 
r, using the approximate model is 0(kN + r). 

To find the probability to find at least one copy of the file with the exact modeling, we need to 
calculate the r-th power of the transition matrix Q and then solve equations ([9j). (|10p sequentially. 
The first computation involves the multiplication of an N x N transition probability matrix, with 
complexity at worst O (-^V 3 )- For sufficiently large r, we can consider the sequence of matrices Q, 
Q 2 , Q 4 , Q 2 , instead of computing the sequence Q, Q 2 i Q r ■ Since the former one converges 
considerably faster compared with the latter one. Therefore, we can compute the matrix power 
in O (ln(r)iV 3 ) steps. The computation of Q involves two binomial coefficients, namely, ( ~*) 
and ( N ^ l 1 ), which have complexity O (mN) . Therefore, it takes O (miV 2 ) steps to solve ([9]). The 
total computational complexity is thus dominated by the complexity to compute the matrix power, 
which is O (ln(r)N 3 ) m our case. 

A. 2 Comparison of accuracy 

We next compare the relative accuracy of the approximate model for calculating the mean number 
of steps to find at least one copy of the file and the mean number of nodes activated in the search. 
The relative accuracy is calculated as (1 — \exact — approx\/exact)100%, and is output with two 
decimal digits. It is reminded that the comparison is done when the cooperation probability is one. 
Results are shown in Table [1] below, where ./V stands for the total number of nodes in the network 
(including the initiator). 

These results confirm that the approximation becomes more accurate when the number of nodes 

in the network increases. The greatest inaccuracy is observed for a relatively large - compared to 



N - number of queried neighbours k, as was also indicated in Fig. 1(f) Generally, the model proves 



to be credible, with a relative accuracy that is higher than 95% in the majority of the above cases. 



3 Forj>z>0, (j) = ('7 1 ) + (ti),witti (j) = ® 
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Table 1: Relative accuracy (%) of the approximate model for blind search 



(a) Mean number of rounds 
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